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Abstract — A large fraction of an XML document typically 
consists of text data. The XPath query language allows text 
search via the equal, contains, and starts-with predicates. Such 
predicates can efficiently be implemented using a compressed 
self-index of the document's text nodes. Most queries, however, 
contain some parts of querying the text of the document, plus 
some parts of querying the tree structure. It is therefore a 
challenge to choose an appropriate evaluation order for a given 
query, which optimally leverages the execution speeds of the text 
and tree indexes. Here the SXSI system is introduced; it stores the 
tree structure of an XML document using a bit array of opening 
and closing brackets, and stores the text nodes of the document 
using a global compressed self-index. On top of these indexes 
sits an XPath query engine that is based on tree automata. The 
engine uses fast counting queries of the text index in order to 
dynamically determine whether to evaluate top-down or bottom- 
up with respect to the tree structure. The resulting system has 
several advantages over existing systems: (1) on pure tree queries 
(without text search) such as the XPathMark queries, the SXSI 
system performs on par or better than the fastest known systems 
MonetDB and Qizx, (2) on queries that use text search, SXSI 
outperforms the existing systems by 1-3 orders of magnitude 
(depending on the size of the result set), and (3) with respect to 
memory consumption, SXSI outperforms all other systems for 
counting-only queries. 

I. Introduction 

As more and more data is stored, transmitted, queried, 
and manipulated in XML form, the popularity of XPath 
and XQuery as languages for querying semi-structured data 
spreads faster. Solving those queries efficiently has proved to 
be quite challenging, and has triggered much research. Today 
there is a wealth of public and commercial XPath/XQuery 
engines, apart from several theoretical proposals. 

In this paper we focus on XPath, which is simpler and forms 
the basis of XQuery. XPath query engines can be roughly 
divided into two categories: sequential and indexed. In the 
former, which follows a streaming approach, no preprocessing 
of the XML data is necessary. Each query must sequentially 
read the whole collection, and the goal is to be as close as 
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possible to making just one pass over the data, while using as 
little main memory as possible to hold intermediate results and 
data structures. Instead, the indexed approach preprocesses the 
XML collection to build a data structure on it, so that later 
queries can be solved without traversing the whole collection. 
A serious challenge of the indexed approach is that the index 
can use much more space than the original data, and thus may 
have to be manipulated on disk. Indexed schemes access the 
data more randomly than sequential ones, thus given the way 
disk costs favor sequential accesses, an indexed scheme can be 
outperformed by a streaming one even if the former accesses 
a relatively small fraction of the data. This is especially true in 
applications where the data itself can fit in main memory but 
the index cannot, so that streaming approaches do not need to 
access the disk at all. Those applications are becoming more 
common as current main memories become able of holding 
a few gigabytes of XML data. Examples of such systems are 
Qizx/DB [1], MonetDB/XQuery [2] and Tauro [3]. 

In this work we aim at an index for XML that uses little 
space compared to the size of the data, so that the indexed 
collection can fit in main memory for moderate-sized data, 
thereby solving XPath queries without any need of resorting 
to disk. An in-memory index should outperform streaming 
approaches, even when the data fits in RAM. Note that 
usually, main memory XML query systems (such as Saxon [4], 
Galax [5], Qizx/Open [1], etc.) use machine pointers to 
represent XML data; this blows up the memory consumption 
to about 5-10 times the size of the original XML document. 

An XML collection can be regarded essentially as a text 
collection (that is, a set of strings) organized into a tree 
structure, so that the strings correspond to the text data and the 
tree structure corresponds to the nesting of tags. The problem 
of manipulating text collections within compressed space is 
now well understood [6]-[8], and also much work has been 
carried out on compact data structures for trees [9]— [13]. In 
this paper we show how both types of compact data structures 



can be integrated into a compressed index representation for 
XML data, which is able to efficiently solve XPath queries. 

A feature inherited from its components is that the com- 
pressed index replaces the XML collection, in the sense that 
the data (or any part of it) can be efficiently reproduced from 
the index (and thus the data itself can be discarded). The result 
is called a self-index, as the data is inextricably tied to its 
index. A self-index for XML data was recently proposed [14], 
[15], yet its support for XPath is reduced to a very limited 
class of queries that are handled particularly well. 

The main value of our work is to provide the first practical 
and public tool for compressed indexing of XML data, dubbed 
Succinct XML Self-Index (SXSI), which takes little space, 
solves a significant portion of XPath (currently we support 
at least Core XPath [16], i.e., all navigational axes, plus the 
three text predicates = (equality), contains, and starts-with), 
and largely outperforms the best public softwares supporting 
XPath we are aware of, namely MonetDB and Qizx. The 
main challenges in achieving our results have been to obtain 
practical implementations of compact data structures (for texts, 
trees, and others) that are at a theoretical stage, to develop 
new compact schemes tailored to this particular problem, and 
to develop query processing strategies tuned for the specific 
cost model that emerges from the use of these compact data 
structures. The limitations of our scheme are that it is in- 
memory (this is a basic design decision, actually), that it is 
static (i.e., the index must be rebuilt when the XML data 
changes), and that it does not handle XQuery. The last two 
limitations are subject of future work. 

II. Basic Concepts and Model 

We regard an XML collection as (i) a set of strings and 
(ii) a labeled tree. The latter is the natural XML parse tree 
defined by the hierarchical tags, where the (normalized) tag 
name labels the corresponding node. We add a dummy root 
so that we have a tree instead of a forest. Moreover, each text 
node is represented as a leaf labeled #. Attributes are handled 
as follows in this model. Each node with attributes is added a 
single child labeled @, and for each attribute @attr=value 
of the node, we add a child labeled attr to its @-node, and a 
leaf child labeled % to the attr-node. The text content value 
is then associated to that leaf. Therefore, there is exactly one 
string content associated to each tree leaf. We will refer to 
those strings as texts. 

Let us call T the set of all the texts and u its total length 
measured in symbols, n the total number of tree nodes, £ 
the alphabet of the strings and a — |E|, t the total number 
of different tag and attribute names, and d the number of 
texts (or tree leaves). These receive text identifiers which are 
consecutive numbers assigned in a left-to-right parsing of the 
data. In our implementation S is simply the set of byte values 
1 to 255, and will act as a special terminator called $. This 
symbol occurs exactly once at the end of each text in T and 
is lexicographically smaller than the other symbols in £. We 
can easily support multi-byte encodings such as Unicode. 



To connect tree nodes and texts, we define global identifiers, 
which give unique numbers to both internal and leaf nodes, in 
depth-first preorder. Figure Q] shows a toy collection (top left) 
and our model of it (top right), as well as its representation 
using our data structures (bottom), which serves as a running 
example for the rest of the paper. In the model, the tree is 
formed by the solid edges, whereas dotted edges display the 
connection with the set of texts. We created a dummy root 
labeled &, as well as dummy internal nodes #, @, and %. 
Note how the attributes are handled. There are 6 texts, which 
are associated to the tree leaves and receive consecutive text 
numbers (marked in italics at their right). Global identifiers 
are associated to each node and leaf (drawn at their left). 
The conversion between tag names and symbols, drawn within 
the bottom-left component, is used to translate queries and to 
recreate the XML data, and will not be further mentioned. 

Some notation and measures of compressibility follow, 
preceding a rough description of our space complexities. Loga- 
rithms will be in base 2. The empirical k-th order entropy [17] 
of a sequence S, Hk(S) < logcr, is a lower bound to the 
output size per symbol of any fc-th order compressor applied 
to S. We will build on self-indexes able of handling text 
collections T of total length u within uHk(T) + o(u logcr) 
bits. On the other hand, representing an unlabeled tree of 
n nodes requires 2n — o(n) bits, and several representations 
using 2n + o(n) bits support many tree query and navigation 
operations in constant time. The labels require in principle 
other n log t bits. Sequences S can be stored within their zero- 
order entropy, |5|iJo(S') + °(\S\ log a), so that any element 
S[i] can be accessed, and they can also answer queries 
rank c (S,i) (the number of c's in £[l,i]) and select c (S, j) 
(the position of the j-th c in S). These are essential building 
blocks for more complex functionalities, as seen later. 

The final space requirement of our index will include: 

1) uHk(T) + o(u logcr) bits for representing the text col- 
lection T in self-indexed form. This supports the string 
searches of XPath and can (slowly) reproduce any text. 

2) 2n + o(n) bits for representing the tree structure. This 
supports many navigational operations in constant time. 

3) dlogd + o(d\ogd) bits for the string-to-text mapping, 
e.g., to determine to which text a string position belongs, 
or restricting string searches to some texts. 

4) Optionally, u logcr or uHu{T) + o(u log a) bits, plus 
O(dlog^), to achieve faster text extraction than in 1). 

5) 4nlogi + 0(n) bits to represent the tags in a way that 
they support very fast XPath searches. 

6) 2n + o(n) for mapping between tree nodes and texts. 
As a practical yardstick: without the extra storage of texts 

(item 4) the memory consumption of our system is about 
the size of the original XML file (and, being a self-index, 
includes it!), and with the extra store the memory consumption 
is between 1 and 2 times the size of the original XML file. 

In Section [III] we describe our representation of the set 
of strings, including how to obtain text identifiers from text 
positions. This explains items 1, 3, and 4 above. Section HVl 
describes our representation for the tree and the labels, and the 



XML data 



<part @name="pen"> 
Soon discontinued. 
<color>blue</color> 
<stock>40</stock> 

</part> 

<part @name= " rubber "> 

<stock>30</stock> 
</part> 



Tree 

P*r= ((((()))()(())(()))(((()))(()))) 

Tag = *p@n%/%/n/@#/#c#/#/cs#/#/s/pp@n%/%/n/@s#/#/s 
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01 000000000000000001 000000000000 
0000000000000000001 000000000001 
0001 000000000000000001 0000000000 
0000001 000000000000000001 0000000 
00000000001000000000000000000000 
00000000000001000000000000000000 
000000000000001 000000000001 00000 
000000000000000001 000000000001 00 
000000001 001 0001 000000000001 0000 
0000000001 001 0001 000000000001 000 
001 00000000000000000001 000000000 
00000001 000000000000000001 000000 
00001 00000000000000000001 0000000 
00000100000000000000000100000000 



p = "part" 
n = "@name" 
c = "color" 
s = "stock" 



lp/* 



Text collection 

T= pen$Soon discontinued$blue$40$rubber$30$ 
F= $$$$$$0034 Sbbbcddeeeeiilnnnnoooprrstuuu 
L = T bw '= nde0r043$$n$ub$se uupbtdbeooiocS$e$inrln 



Doc 




Fig. 1 . Our running example on representing an XML collection. 



way the correspondence between tree nodes and text identifiers 
works. This explains items 2, 5, and 6. Section [V] describes 
how we process XPath queries on top of these compact data 
structures. In Section [VI] we empirically compare our SXSI 
engine with the most relevant public engines we are aware of. 

III. Text Representation 

Text data is represented as a succinct full-text self-index 
[6] that is generally known as the FM-index [18]. The index 
supports efficient pattern matching that can be easily extended 
to support different XPath predicates. 

A. FM-Index and Backward Searching 

Given a string T of total length u, from an alphabet of size 
a, the alphabet- friendly FM-index [19] requires uHk(T) + 
o(it log er) bits of space. The index supports counting the 
number of occurrences of a pattern P in 0(|P|logCT) time. 
Locating the occurrences takes extra 0(log 1+e ulogcr) time 
per answer for any constant e > 1. 

The FM-index is based on the Burrows- Wheeler transform 
of a string T [20]. Assume T ends with the special end-marker 
$. Let Ai be a matrix whose rows are all the cyclic rotations 
of T in lexicographic order. Now the last column L of Ai 
forms a permutation of T which is the BWT string L = T bwt . 
The matrix is only conceptual; the FM-index operates only on 
the actual T bwt sequence. See Figure Q] (bottom right). 

The resulting permutation is reversible. The first column 
of Ai, denoted F, contains all symbols of T in lexicographic 
order. There exists a simple last-to-first mapping from symbols 
in L to F [18]: Let C[c] be the total number of symbols 
in T that are lexicographically less than c. Now LF-mapping 
can be defined as LF(i) = C[L[i]] + rankL[i](L,i). Symbols 



of T can be read in reverse order by starting from end- 
marker location i and applying LF(i) recursively: we get 
T bwt [i],T bwt [LF{i)],T bwt [LF(LF{i))] etc. and finally, after 
u steps, get the first symbol of T. The values C[c] can be 
stored in a small array of a log u bits. The function rank c (L, i) 
can be computed in O(logcr) time with a wavelet tree data 
structure requiring only uHk(T) + o(uloger) bits [19], [21]. 

Pattern matching is supported via backward searching on 
the BWT [18]. Given a pattern P of length m, the backward 
search starts by finding the range [sp, ep] of rows in Ai 
that have P[m] as a prefix, say sp = C[P[m]] and ep — 
C[P[m] + 1]. At each step i G {m - 1, m - 2, . . . , 1} of the 
backward search, the range [sp, ep] is updated to match all 
rows of M that have P[i,m] as a prefix. New range [sp', ep'] 
is given by sp' — C[P[i]] + rankpu](L,sp — 1) + 1 and 
ep' = C[P[i]] + rankpu](L,ep). Each step takes O(logcr) 
time [19], and finally ep — sp+ 1 gives the number of times P 
occurs in T. To find out the location of an occurrence, the text 
is traversed backwards (virtually, using LF on T bwt ) until a 
sampled position is found. If every I = 0(log 1+c u) positions 
of T is sampled, locating takes 0(1 loger) time per occurrence. 

B. Text Collection and Queries 

The textual content of the XML data is stored as $- 
terminated strings so that each non-empty element corresponds 
to one text. Let T be the concatenated sequence of d texts. 
Since there are several $'s in T, we fix a special ordering such 
that the end-marker of the i-th text will appear at F[i] in Ai. 
This generates a valid T bwt of all the texts and makes it easy 
to extract the i-th text starting from its $-terminator. 

Now T bwt contains all end-markers in some permuted order. 
This permutation is represented with a data structure, denoted 



Doc, that allows two-dimensional range searching [22] (see 
Figure Q]). Mapping from a $-terminator in position T bwt [i] to 
its entry in Doc can be calculated by rank$(T bwt , i). Given a 
range [sp, ep] of T bwt and a range of text identifiers [x,y], 
Doc can be used to output identifiers of all $-terminators 
within [sp, ep] x [a;, y] range in (9(log d) time per answer. Doc 
requires d\ogd(l + o(l)) bits of space. 

The basic pattern matching feature of the FM-index can be 
extended to support XPath functions such as starts-with, ends- 
with, contains, and operators =, <, <, >, > for lexicographic 
ordering. Given a pattern and a range of text identifiers to be 
searched, these functions return all text identifiers that match 
the query within the range. In addition, existential (is there 
a match in the range?) and counting (how many matches 
in the range?) queries are supported. Time complexities are 
0(|P| log a) for the search phase, plus an extra for reporting: 

1) starts-with(P, [x,y]): The goal is to find texts in [x,y] 
range prefixed by the given pattern P. After the normal 
backward search, the range [sp, ep] in T bwt contains end- 
markers of all texts prefixed by P. Now [sp,ep] x [x,y] can 
be mapped to Doc, and existential and counting queries can 
be answered in O(logd) time. Matching text identifiers can 
be reported in O(logd) time per identifier. 

2) ends-with(P,[x,y]): The given pattern is appended with 
$. Backward searching is localized to texts [x, y] by choosing 
sp = x and ep = y as the starting interval. After the 
backward search, the resulting range [sp, ep] contains all 
possible matches, thus, existential and counting queries can 
be answered in constant time. To find out text identifiers for 
each occurrence, text must be traversed backwards to find a 
sampled position of the beginning of the current text. Cost is 
O (I loger) per answer. 

3) operator — (P, [x, y\): texts that are equal to P, and in 
range, can be found as follows. Do the backward search as in 
ends-with, then map to the $-terminators like in starts-with. 
Time complexities are same as in starts-with. 

4) containsiP, [x, y]): To find texts that contain P, we start 
with the normal backward search and finish like in ends-with. 
In this case there might be several occurrences inside one 
text, which have to be filtered. Thus, the time complexity is 
proportional to the total number of occurrences, 0(1 log a) 
for each. Existential and counting queries are as slow as 
reporting queries, but the 0(|P| log cr)-time counting of all the 
occurrences of P can still be useful for query optimization. 

5) operators <, <, >, >: The operator < matches texts 
that are lexicographically smaller than or equal to the given 
pattern. It can be solved like the starts-with query, but updating 
only the ep of each backward search step, while sp = 1 stays 
constant. If at some point there are no occurrences of c within 
the prefix L[l, ep], we find those of smaller symbols in the 
range. This can be done by regarding the wavelet tree of the 
BWT as a range search data structure (as in previous work 
[22]). Other operators can be supported in similar fashion, 
and time complexities are the same as in starts-with. 

The new XPath extension, XPath Full Text 1.0, suggests 
a wider selection of functionality for text searching. Imple- 



mentation of these extensions requires regular expression and 
approximate searching functionalities, which can be supported 
within our index using the general backtracking framework 
[23]: The idea is to alter the backward search to branch re- 
cursively to different ranges [sp', ep'] representing the suffixes 
of the text prefixes (i.e. substrings). This is done simply by 
computing sp' c = C[c] + rank c (L, sp — 1) + 1 and ep' c = 
C[c] + rank c (L, ep) for all c E £ at each step and recursing 
on each [sp' c , ep' c ]. Then the pattern (or regular expression) 
can be compared against all substrings of the texts, allowing 
to search for approximate occurrences [23]. The running time 
becomes exponential in the number of errors allowed, but 
different branch-and-bound techniques can be used to obtain 
practical running times [24], [25]. We omit further details here, 
since these extensions are out of the scope of this paper. 

C. Implementation details 

The FM-index can be built by adapting any BWT construc- 
tion algorithm. Linear time algorithms exist for the task, but 
their practical bottleneck is the peak memory consumption. 
Although there exist general time- and space-efficient con- 
struction algorithms, it turned out that our special case of 
text collection admits a tailored incremental BWT construction 
algorithm [26] (see the references and experimental compar- 
ison therein for previous work on BWT construction): The 
text collection is split into several smaller collections, and 
a temporary index is built for each of them separately. The 
temporary indexes are then merged, and finally converted into 
a static FM-index. 

The current implementation supports all the XPath text 
queries based on substring matching. We have also imple- 
mented approximate string matching and an experimental 
support for regular expressions. To enable fast text extraction 
from the collection, we allow storing the texts in plain format 
in n loger bits, or in an enhanced LZ78-compressed format 
(derived from the LZ-index [27]) using uHk(T) + o(it log c) 
bits. These secondary text representations are coupled with a 
delta-encoded bit vector storing starting positions of each text 
in T. This requires O(dlog^) more bits. 

IV. Tree Representation 
A. Data Representation 

The tree structure of an XML collection is represented 
by the following compact data structures, which provide 
navigation and indexed access to it. See Figure Q] (bottom left). 

1) Par: The balanced parentheses representation [28] of 
the tree structure. This is obtained by traversing the tree in 
depth-first-search (DFS) order, writing a " ( " whenever we 
arrive at a node, and a " ) " when we leave it (thus it is 
easily produced during the XML parsing). In this way, every 
node is represented by a pair of matching opening and closing 
parentheses. A tree node will be identified by the position 
of its opening parenthesis in Par (in other words, in this 
representation a node is just an integer index within Par). In 
particular, we will use the balanced parentheses implementa- 
tion of Sadakane [13], which supports a very complete set 



of operations, including finding the i-th child of a node, in 
constant time. Overall Par uses 2n + o(n) bits. This includes 
the space needed for constant-time binary ranks on Par, which 
are very efficient in practice. 

2) Tag: A sequence of the tag identifiers of each tree node, 
including an opening and a closing version of each tag, to mark 
the beginning and ending point of each node. These tags are 
numbers in [1, 2t] and are aligned with Par so that the tag of 
node i is simply Tag[i]. 

We will also need rank and select queries on Tag. Several 
sequence representations supporting these are known [29]. 
Given that Tag is not too critical in the overall space, but 
it is in time, we opt for a practical representation that favors 
speed over space. First, we store the tags in an array using 
log(2i) bits per field, which gives constant time access to 
Tag[i]. The rank and select queries over the sequence of 
tags are answered by a second structure. Consider the binary 
matrix M[1..2i][1..7i] such that entry is 1 if and only 

if Tag\j] = i. We represent each row of the matrix using 
Okanohara and Sadakane's structure sarray [30]. Its space 
requirement for each row i is fcj log ^ + fe«(2 + o(l)) bits, 
where fcj is the number of times symbol i appears in Tag. 
The total space of both structures adds up to 2nlog(2t) + 
2nH (Tag)+n(2 + o(l)) < An\ogt + 0(n) bits. They support 
access and select in O(l) time, and rank in O(logn) time. 

B. Tree Navigation 

We define the following operations over the tree structure, 
which will be useful to support XPath queries over the tree. 
Most of these operations are supported in constant time, except 
when a rank over Tag is involved. Let tag be a tag identifier. 

1 ) Basic Tree Operations: These are direcly inherited from 
Sadakane's implementation [13]. We mention only the most 
important ones for this paper; x is a node (a position in Par). 

• Close(a;): The closing parenthesis matching Par[x}. If x 
is a small subtree this takes a few local accesses to Par, 
otherwise a few non-local table accesses. 

• Preorder(x) = rank( (Par, i): Preorder number of x. 

• SubtreeSize(a;) = (Close(a;) — x+l)/2: Number of nodes 
in the subtree rooted at x. 

« Is Ancestor^ , y) = x < y < Close(a;): Whether x is an 
ancestor of y. 

• FirstChild(x) = x + 1: First child of x, if any. 

• NextSibling(a:) = Close (x) + 1: Next sibling of x, if any. 

• Parent(a;): Parent of x. Somewhat costlier than Close(a;) 
in practice, because the answer is less likely to be near 
x in Par. 

2) Connecting to Tags: The following operations are es- 
sential for our fast XPath evaluation. 

• SubtreeTags(ir, tag): Returns the number of occurrences 
of tag within the subtree rooted at node x. This is 
rank ta g{Tag, Close(ac)) - rank tag {Tag,x - 1). 

• Tag(x): Gives the tag identifier of node x. In our repre- 
sentation this is just 7ag[x]. 

• TaggedDesc(a;, tag): The first node labeled tag with 
preorder larger than that of node x, and within the subtree 



rooted at x. This is selectt a g(Tag, rankt a g(Tag, x) + 1) 
if it is < Close(x), and undefined otherwise. 

• TaggedPrec(a;, tag): The last node labeled tag with pre- 
order smaller than that of node x, and not an ancestor of 
x. Let r = rank ta g(Tag, x — 1). If select ta g{Tag, r — 1) 
is not an ancestor of node x, then we stop. Otherwise, 
we set r = r — 1 and iterate. 

• TaggedFoll(x, tag): The first node labeled tag with pre- 
order larger than that of x, and not in the subtree of x. 
This is selecttag(Tag , rankt cl g(Tag ,Close(x)) + 1). 

3) Connecting the Text and the Tree: Conversion between 
text numbers, tree nodes, and global identifiers, is easily 
carried out by using Par and a bitmap B of 2n bits that marks 
the opening parentheses of tree leaves, plus o(n) extra bits to 
support rank/select queries. Bitmap B enables the computation 
of the following operations: 

• LeafNumber(x): Gives the number of leaves up to x in 
Par. This is ranki(B,x). 

• Textlds(a;): Gives the range of text identifiers that de- 
scend from node x. This is simply [LeafNumber(:c — 1) + 
1 , LeafNumber(Close (x))}. 

• XMLIdText(d): Gives the global identifier for the text 
with identifier d. This is Preorder(seZecti(B, d)). 

• XMLIdNode(a;): Gives the global identifier for a tree 
node x. This is just Preorder(x). 

C. Displaying Contents 

Given a node x, we want to recreate its text (XML) content, 
that is, return the string. We traverse the structure starting from 
Par[a;], retrieving the tag names and the text contents, from the 
text identifiers. The time is O(loger) per text symbol (or 0(1) 
if we use the redundant text storage described in Section UTTb 
and 0(1) per tag. 

• GetText (d): Generates the text with identifier d. 

• GetSubtree (x): Generates the subtree at node x. 

D. Handling Dynamic Sets 

During XPath evaluation we need to handle sets of interme- 
diate results, that is, global identifiers. Due to the mechanics 
of the evaluation, we need to start from an empty set and later 
carry out two types of operations: 

• Insert a new identifier to the result. 

« Remove a range of identifiers (actually, a subtree). 

To remove a range faster than by brute force, we use a data 
structure of 2n — 1 bits representing a perfect binary tree over 
the interval of global identifiers, so that leaves of this binary 
tree represent individual positions and internal nodes ranges 
of positions (i.e., the union of their child ranges). A bit mark 
at each such internal node can be set to zero to implicitly set 
all its range to zero. A position is in the set if and only if all 
of its path from the root to it is not zero. Thus one can easily 
insert elements in O(logn) time, and remove ranges within 
the same time, as any range can be covered with O(logn) 
binary tree nodes. 



V. XPath Queries 

The aim is to support a practical subset of XPath, while 
being able to guarantee efficient evaluation based on the data 
structures described before. As a first shot we will support 
the "Core XPath" subset [16] of XPath 1.0. It supports all 
12 navigational axes, all node tests, and filters with Boolean 
operations (and, or, not). In our prototype implementation, all 
axes have been implemented, but only the forward fragment 
(consisting of self, child, descendant, and following-sibling) 
has been fully optimized. We therefore focus here only on 
these two axes. A node test (non-terminal NodeTest below) 
is either the wildcard ('*'), a tag name, or a node type test, 
i.e., one of text() or node(); the node type tests comment() 
and processing-instruction() are not supported in our current 
prototype. Of course, we support all text predicates of XPath 
1.0, i.e., the =, contains, and starts-with predicates. Here is 
an EBNF for Core XPath. 



Core 

LocationPath 
LocationStep 

Pred 



:= LocationPath | 7' LocationPath 
:= LocationStep (V LocationStep)* 
:= Axis NodeTest 

Axis '::' NodeTest '[' Pred ']' 
:= Pred 'and' Pred | Pred 'or' Pred 

I 'not' '(' Pred ')' | Core | '(' Pred ')' 



A data value is the value of an attribute or the content of 
a text node. Here, all data values are considered as strings. 
If an XPath expression selects only data values, i.e., its final 
location step is the attribute -axis or a text() test, then we call it 
a value expression. Our XPath fragment ("Core+"), consists of 
Core XPath plus the following data value comparisons which 
may appear inside filters (that is, may be generated by the 
nonterminal Pred of above). Let «i be a string and p a value 
expression; if p equals . (dot) or self and the XPath expression 
to the left of the filter is a value expression, then p is a value 
expression as well. 

• p = w (equality): tests if a string selected by p is equal 
to w. 

• contains(w,p): tests if the string w is contained in a string 
selected by p. 

• starts-with (p, w): tests if the string w is a prefix of a 
string selected by p. 

A. Tree Automata Representation 

It is well-known that Core XPath can be evaluated using tree 
automata; see, e.g., [31]. Here we use alternating tree automata 
(as in [32]). Such automata work with Boolean formulas over 
states, which must become satisfied for a transition to fire. This 
allows much more compact representation of queries through 
automata, than ordinary tree automata (without formulas). Our 
tree automata work over a binary tree view of the XML tree 
where the left child is the first child of the XML node and the 
right child is the next sibling of the XML node. 

Definition 5.1 (Non-deterministic marking automaton): 
An automaton A is a tuple (£, Q,I, S), where C is the infinite 
set of all possible tree labels, Q is the finite set of states, 
T C Q is the set of initial states, and 6 : Q x 2 C — > F is the 
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±,0 otherwise 
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otherwise 
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Fig. 2. Inference rules defining the evaluation of a formula 



transition function, where F is a set of Boolean formulas. A 
Boolean formula </> is generated by the following EBNF. 



:= T|_L|^V0|0A0 
:= li q I 42 q 



a | p (formula) 
(atom) 



where p 6 P is a built-in predicate and q is a state. We call 
F the set of well-formed formulas. 

Definition 5.2 (Evaluation of a formula): 
Given an automaton A and an input tree t, the 
evaluation of a formula is given by the judgement 

where IZi and IZ2 are mappings from states to sets of subtrees 
of t, t' is a subtree of t <f> is a formula, b G {T, _L} and R is a 
set of subtrees of t. We define the semantics of this judgment 
by the mean of inference rules, given in Figure [2] 

These rules are pretty straightforward and combine the 
rules for a classical alternating automaton, with the rules of 
a marking automaton. Rule (or) and (and) implements the 
Boolean connective of the formula and collect the marking 
found in their true sub-formulas. Rules (left) and (right) 
(written as a rule schema for concision) evaluate to true if the 
state q is in the corresponding set. Intuitively, IZi (resp. 1Z 2 ) is 
the set of states accepted in the left (resp. right) subtree of the 
input tree. Rule (pred) supposes the existence of an evaluation 
function for built-in predicates. Among the latter, we suppose 
the existence of a special predicate, mark which evaluates to 
T and returns the singleton set containing the current subtree. 
We can now give the semantics of an automaton, by the means 



of a run function. 

Algorithm 5.1 (Top-down run function): 
Inputs = (£, Q, 1, 5),t,r Output:^ 

where A is the automaton, t the input tree, r a set of states and 1Z 
a mapping from states of Q to sets of subtrees of t and such that 
dom(7?,) C r. 

1 function topjiownj'un At r = 



2 


if t is the empty tree then return 


3 


else 


4 


let trans = {(q,i) (f> \ q 6 r and Tag(i) g £} 


5 


in 


6 


let r, = {g U q G 0,V</!> G rra«j} 


7 


in 


8 


let = top-down .run A FirstChild(r) ri 


9 


and 7^2 = top_downjrun A NextSibling(f) r2 


10 


in return 


1 1 


, . K 1 ,1l< l ,t\-A,<t>=(T,R), , 
19 ^« \/(q,t^d>) <E trans 1 



The algorithm is straightforward. Although we called this 
function topjiownsun, it is clear that it corresponds to the 
classical notion of bottom-up run for an automaton. Indeed, 
even though this function is called on the root node with the 
initial set of states, the sequence of recursive calls first reaches 
the leaves of the tree and starts evaluating the transitions while 
"returning" from a recursive call, hence when moving upward 
in the tree. This algorithm works in a very general setting. 
Considering any subtree t of our input tree, let 1Z be the result 
of top_down_run(.A, t, Q). Then dom(7?.) is the set of states 
which accepts t and Vq G dom(7?.), IZ(q) is the set of subtrees 
of t marked during a run starting from q on the tree t. It 
is easy to see that the evaluation of top_down_run(^4, t, r) 
takes time 0(|-4| x \t\), provided that the operations ©, © and 
eval_pred can be evaluated in constant time. 

B. From XPath to Automata 

The translation from XPath to alternating automata is 
simple and can be done in one pass through the parse tree 
of the XPath expression. Roughly speaking, the resulting 
automaton is "isomorphic" to the original query (and 
has approximately the same size). All our optimization 
discussed later are on-the-fly algorithms; for instance, we 
only determinize the automaton during its run on the 
input tree. Here, we only give an example of a query 
and its corresponding automaton: Consider the query 
/descendant: : listitem/descendant : :keyword. 
The corresponding automaton is A = (£, {qo, qi} 1 {qo}, 8) 
where 8 contains the following transitions: 

1 go, {list item} — > J.i gi 4 gi. {keyword}— >mark 

2 g ,£-{@,#} -^|igo 5 qi ,£ -{<§»,#} ->U «i 

3 go,£ — >|a qo 6 gi,£ — >J.2 qi 

The automaton starts in state {qo} and traverses the tree until it 
finds a subtree labeled list item. At such a subtree, the au- 
tomaton changes to state {q , qi} on the left subtree (because 
it is non-deterministic and two transitions fire), looking for a 
tag keyword or possibly another tag listitem and it will 
recurse on the right subtree in state {qo} again. Transitions 
2 and 5 make sure that, according to the semantics of the 
descendant axis, only element nodes (and not text or attributes) 



are considered. If, in state {qo,qi} it finds a node labeled 
keyword then this node is marked as a result node. 

C. General Optimizations, On-the-Fly Memoization 

In Algorithm 15. li the most expensive operation is in Line 11, 
which is evaluating the set of possible transitions and accu- 
mulating the mappings. First, note that only the states outside 
of filters actually accumulate nodes. All other states always 
yield empty bindings. Thus we can split the set of states into 
marking and regular states. This reduces the number of © and 
© operations on results sets. Note also that given a transition 
qi,£ - ►ii qjV I2 qk where q\, qj and qj. are marking states, 
all nodes accumulated in qj are subtrees of the left subtree 
of the input tree. Likewise, all the nodes accumulated in q^ 
are subtrees of the right subtree of the input tree. Thus both 
sets of nodes are disjoint. Therefore, we do not need to keep 
sorted sets of nodes but only need sequences which support 
0(1) concatenation. Thus, computing the union of two result 
sets Rj and Rj~ can be done in constant time and therefore © 
and © can be implemented in constant time. 

Another important practical improvement exploits the fact 
that the automata are very repetitive. For instance if an XPath 
query does not contain any data value predicate (such as 
contains) then its evaluation only depends on the tags of 
the input tree. We can use this to our advantage to memoize 
the results based on the tag of the input tree and the set r. 
Indeed, the set r and the tag of the input tree t uniquely define 
the set trans of possible transitions. So instead of computing 
such a set at every step, we can cache it in a hash-table 
where the key is the pair (Tag(t),r); this corresponds to an 
on-the-fly determinization of automata. Of course computing 
the hash of such a key must be fast for the operation to be 
beneficial. Labels are not a problem since, they are internally 
represented as integers. Sets however are trickier. We use the 
technique described in [33]. Basically, we represent sets of 
integers (which can be used for set of states, sets of tags, 
sets of transitions,...) as hash-consed Patricia-trees, which 
support 0(1) hashing and 0(1) equality checking. In practice 
the number of different values for the input set r is very small. 
We can further improve the running time by using an array of 
hash-tables instead of a hash-table indexed by (label(f),r). 
We can apply a similar technique for the other expensive 
operation, that is, the evaluation of the set of formulas. 
This operation can be split in two parts: the evaluation of 
the formulas and the propagation of the result sets for the 
corresponding marking states. Again, if the formulas do not 
contain data value predicates, then their value only depends 
on the states present in 1Z\ and IZ2, the results of the recursive 
calls. Using the same technique, we can memoize the results 
in a hash table indexed by the key (dom(7?.i), dom(7?-2))- This 
hash table contains the pair dom(7?.) of the states in the result 
mapping and a sequence of affectation to evaluate, of the 
form [qi-.=concat(qj,qk), ■ ■ ■], which represents that need 
to be propagated between the different marking states. Another 
optimization is for the result set associated with the initial state 
of the automaton, which is answer of the query. This result 



set is "final" in the sense that anything that was propagated 
up to it will be in the result set. We can exploit this fact and 
use a more compact data-structure for this set of results (for 
instance the one described in Section lTV-Db . Thus we can trade 
time complexity (since insertion is 0(log(n)) in this structure) 
for space. Using this scheme, we are able to answer queries 
containing billions using little memory. 

D. Leveraging the Speed of the Low-Level Interface 

Conventionally, the run of a tree automaton visits every 
node of the input tree. For highly efficient XPath evaluation, 
this is not good enough and we must find ways to restrict 
the run to the nodes that are "relevant" for the query (this 
is precisely what is also done through "partitioning and 
pruning" in the staircase join [34]). Consider the query 
/descendant: : listitem/descendant : :keyword 
of before. Clearly, we only care about listitem and keyword 
nodes for this query, and how they are situated with respect 
to each other. This is precisely the information that is 
provided through the TaggedDesc and TaggedFoll functions 
of the tree representation. These functions allow us to have 
a "contracted" view of the tree, restricted to nodes with 
certain labels of interest (but preserving the overall tree 
structure). For instance, to solve the above query we can call 
TaggedDesc(Root,listitem) which selects the first listitem-node 
x. Now we can apply recursively TaggedDesc(x,keyword) 
and TaggedFoll(y,keyword) in order to select all keyword- 
descendants of x. We do this optimization of "jumping run" 
based on the automaton: for a given set of states of the 
automaton we compute the set of relevant transitions which 
cause a state change. For instance, in the automaton for the 
above query which is shown in Section [V-BI only transitions 1 
and 4 are relevant. Thus, in state {qo} the automaton can use 
TaggedDesc to jump to listitem nodes, and in state {qo,qi} 
it can jump to listitem or keyword nodes. 

Bottom-up run: While the previous technique works 
well for tree-based queries it still remains slow for 
value-based queries. For instance, consider the query 
//list item/ /keyword [contains ( . , "Unique" ) ]. 
The text interface described in Section [III] can answer 
the string query very efficiently returning the set of text 
nodes matching this contains query. It is also able to count 
globally the number of such results. If this number is low, 
and in particular smaller than the number of listitem 
or keyword tags in the document (which can also be 
determined efficiently through the tree structure interface), 
then it would be faster to take these text nodes as starting 
point for query evaluation and test if their path to the root 
matches the XPath expression //listitem//keyword. 
This scheme is particularly useful for text oriented queries 
with low selectivity. However, it also applies for tree only 
queries: imagine the query //listitem//keyword on a 
tree with many listitem nodes but only a few keyword nodes. 
We can start bottom-up by jumping to the keyword nodes 
and then checking their ancestors for listitem nodes. 




topdownrun 



Fig. 3. Illustration of the bottom-up run 



To achieve this goal, we devise a real bottom-up eval- 
uation algorithm of an automaton. The algorithm takes an 
automaton and a sequence of potential matching nodes (in our 
example, the text nodes containing the string "Unique"). 
It then moves up to the root, using the parent function 
and checks that the automaton arrives at the root node in its 
initial state (ft. This scheme is illustrated in Figure [3] The 
technique used is similar to shift-reduce parsing. Consider 
a sequence [ti,...,t n ] (ordered in pre-order) of potentially 
matching subtrees. In our previous example these were text 
nodes but this is not a necessary condition. The algorithm 
starts on tree t±. First, if the tree is not a leaf, we call the 
top.down.run function on t\ with r = Q. This returns the 
mapping IZi of all states accepting t\. We now want to move 
up to the root from t\ in state dom(7^)i and by taking the 
transitions upward. As illustrated in Figure [3] however, we do 
not want to move blindly from t\ to the root. Indeed, once we 
arrive at a node t[ which is an ancestor of the next potential 
matching subtree <2, then we stop at t[ and start the algorithm 
on t2 until it reaches the lowest common ancestor t[ . Once this 
is done, we can merge both mappings and continue upward 
on from t[ until we reach the root or a common ancestor of t[ 
and £3 and so on. The idea of merging the runs at the lowest 
common ancestor makes sure that we never touch any nodes 
more than once, in a bottom-up move. We now give formally 
the bottom up algorithm. 

Algorithm 5.2 (Bottom-up run function): 
Input: A, s Output:72. 

where A is an automaton, s a sequence of subtrees of the input tree, 
and R a mapping from states of A to subtrees of the input tree. 

1 function bottom_up_run A s = 

2 if s = [] then return 

3 else 

4 let t,s' = hd(i'), tl(s) in 

5 let 72. = top-down-run A t Q in 

6 let 1Z , s — match-above A t s' 1Z # 

7 in 

8 IZ'U (bottom jLipjrun A s") 
9 

10 function match-above At s IZi stop = 

11 if t = stop then 72-1, s 



12 else 

13 let pt = Parent(f) in 

14 let TZ 2 , s' = 

15 if s = [] or no/ (IsAncestor(p/,hd(.s))) 

16 then 0, s 

17 else 

18 let ta.s' = hd(s), tl(s) in 

19 let 1Z = topJtovjnjrun A t2 Q in 

20 match-above A 2 s' 1Z pt 

21 in 

22 let trans - fa I -> 6 I 3g ' G dom ( 7? ' i )* 

22 let trans - tg,« -» <p | labe i( p t) G £ 

23 in 

24 let 71 = {«~Jl| v(«,i-,^)6&BW 

25 in 

26 match_above A pt s' 72.' .stop 



In the light of Figure [3] it is easy to understand the two 
functions defined in Algorithm 15.21 The first one iterates 
the auxiliary function match.above on every tree in the 
sequence s. The match.above function is the one "climbing- 
up" the tree. We assume that the Parent(_) function returns the 
empty tree when applied to the root node. If the input tree is 
not equal to the tree stop (which is initially the empty tree #, 
allowing to stop only after the root node has been processed) 
then we first check whether the next (we use the function hd 
and 1 1 which returns the first element of the list and its tail) 
potential tree is a descendant of our parent (Line 14). If it is 
so, then we pause for the current branch and recursively call 
matchjibove with our parent as stop tree. Once it returns, we 
compute all the possible transitions that the automata can take 
from the parent node to arrive on the left and right subtree 
with the correct configuration (Line 21). Once this is done, 
we merge both configuration using the same computation as in 
the top-down algorithm (Line 23). Finally, we recursively call 
match.above on the parent node, with the new configuration 
and sequence of potential matching nodes (Line 25). 

VI. Experimental Results 

We have implemented a prototype XPath evaluator based 
on the data structures and algorithms presented in previous 
sections. Both the tree structure and the FM-Index were 
developed in C++, while the XPath engine was written using 
the Objective Caml language. 

A. Protocol 

To validate our approach, we benchmarked our implemen- 
tation against two other well established XQuery implemen- 
tations, namely MonetDB/XQuery and Qizx/DB. We describe 
our experimental settings hereafter. 

Test machine: Our test machine features an Intel Core2 
Xeon processor at 3.6Ghz, 3.8 GB of RAM and a S-ATA 
hard drive. The OS is a 64-bit version of Ubuntu Linux. The 
kernel version is 2.6.27 and the file system used to store the 
various files is ext3, with default settings. All tests were run 
on a minimal environment where only the tested program and 
essential services were running. We used the standard compiler 
and libraries available on this distribution (namely g++ 4.3.2, 
libxml2 2.6.32 for document parsing and OCaml 3.11.0). 



Qizx/DB: We used version 3.0 of Qizx/DB engine (free 
edition), running on top of the 64-bit version of the JVM 
(with the -server flag set as recommended in the Qizx user 
manual). The maximal amount of memory of the JVM set 
to the maximal amount of physical memory (using the -Xmx 
flag). We also used the flag -r of the Qizx/DB command 
line interface, which allows us to re-run the same query 
without restarting the whole program (this ensures that the 
JVM's garbage collector and thread machinery do not impact 
the performance). We used the timing provided by Qizx 
debugging flags, and reported the serialization time (which 
actually includes the materialization of the results in memory 
and the serialization). 

MonetDB/XQuery: We used version Feb2009-SP2 of 
MonetDB, and in particular, version 4.28.4 of MonetDB4 
server and version 0.28.4 of the XQuery module {pathfinder). 
We used the timing reported by the "-t" flag of MonetDB 
client program, mclient. We kept the materialization time 
and the serialization time separated. 

Running times and memory reporting: For each query, we 
kept the best of five runs. For Qizx/DB, each individual run 
consists of two repeated runs ("-r 2"), the second one being 
always faster. For MonetDB, before each of the five runs, the 
server was exited properly and restarted. We monitored the 
memory usage by reading, every 200 ms during the duration 
of the tested program, the /proc/pid/statm pseudo-file 
provided by Linux. More specifically, we monitored the so- 
called resident set size, which corresponds to the amount of 
process memory actually mapped in physical memory. For 
MonetDB, we kept track of the memory usage of both server 
and client. The peak of memory reported was the sum of 
client's peak plus server's peak. 

For the tests where serialization was involved, we serialized 
to the /dev/null device (that is, all the results were 
discarded without causing any output operation). 

B. Indexing 

Our implementation features a versatile index. It is divided 
into three parts. First, the tree representation composed of the 
parenthesis structure, as well as the tag structure. Second, the 
FM-Index encoding the text collection. Third, the auxiliary 
text representation allowing fast extraction of text content. 

It is easy to determine from the query which parts of 
the index are needed in order to solve it, and thus load 
only those into main memory. For instance, if a query only 
involves tree navigation, then having the FM-Index in memory 
is unnecessary. On the other hand, if we are interested in 
very selective text-oriented queries, then only the tree part 
and FM-Index are needed (both for counting and serializing 
the results). In this case, serialization is a bit slower (due to 
the cost of text extraction from the FM-Index) but remains 
acceptable since the number of results is low. 

Figure |4] shows the construction time and the memory used 
during the indexing process. For these indexes, a sampling 
factor I = 64 (cf. Section iHll i was chosen. As we see, the 
whole index, including tree structure, FM-index and auxiliary 
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Fig. 4. Indexing of XMark documents 



Q01 /site/regions 

Q02 /site/closed_auctions 

Q03 /site/regions/europe/item/mailbox/mail/text/keyword 
Q04 /site/closed_auctions/closed_auction/annotation/description/ 

parlist/listitem 

Q05 /site/closed_auctions/closed_auction/annotation/description/ 
parlist/listitem/parlist/listitem/*//keyword 
Q06 /site/regions/*/item 
Q07 //listitem//keyword 
Q08 /site/regions/*/item//keyword 

Q09 /site/regions/*/person[ address and (phone or homepage) ] 

Q10 //listitem[.//keyword and .//emph]//parlist 

Qll /site/regions/*/item[ mailbox/mail/date ]/mailbox/mail 

Q12 /*[ descendant::* ] 

Q13 //* 

Q14 //*//* 

Q15 11*11*11*11* 

Q16 //*//*//*//*//*//*//*//* 

Fig. 5. Tree oriented queries 



text representation, is always smaller than twice the size of the 
original document. Since it is always possible to choose which 
text representation to use, the actual main-memory footprint 
of the index is close to the original document size. 

C. Tree queries 

We benchmarked tree queries using the queries given in 
Figure [5] Queries Q01 to Ql 1 were taken from the XPathMark 
benchmark [35], derived from the XMark XQuery benchmark 
suite. Q12 to Q16 are "crash tests" that are either simple (Q12 
selects only the root since it always has at least one descendant 
in our files) or generate the same amount of results but with 
various intermediate result sizes. For this experiment we used 
XMark documents of size 116MB and 1GB. In the cases of 
MonetDB and Qizx, the files were indexed using the default 
settings. Figure [6] reports the running times for both counting 
and materialization+serialization. We report in Figure [7] the 
peak memory use for each query, for the 116MB document. 

From the results of Figure [6] we see how the different 
components of SXSI contribute to the efficient evaluation 
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model. First, queries Q01 to Q06 — which are fully qualified 
paths — illustrate the sheer speed of the tree structure and 
in particular the efficiency of its basic operations (such as 
FirstChild and NextSibling, which are used for the child 
axis), as well as the efficient execution scheme provided by 
the automaton. Query Q07 to Qll illustrate the impact of the 
jumping. Moreover, it shows that filters do not impact the 
execution speed: the conditions they express are efficiently 
checked by the formula evaluation procedure. Finally, Q12 
to Q16 illustrate the robustness of our automata model. In- 
deed while such queries might seem unrealistic, the good 
performances that we obtain are only the consequence of 
using an automata model, which factors in its states all the 
necessary computation and thus do not materialize unneeded 
intermediate results. This coupled together with the compact 
dynamic set of Section IIV-DI allows us to keep a very low 
memory foot-print even when the query generates lots of result 
or that each step as a lot of intermediate results (cf. Figure [7]). 

D. Text queries 

We tested the text capabilities of our XPath engine against 
the most advanced text oriented features of other query engine. 

Qizx/DB: We used the newly introduced Full-Text exten- 
sion of XQuery available in Qizx/DB v. 3.0. We tried to write 
queries as efficiently as possible while preserving the same 
semantics as our original queries. The query we used always 
gave better results than their pure XPath counterpart. In partic- 
ular, we used the ftcontains text predicate, introduced in 
[36] and implemented by Qizx/DB. The ftcontains pred- 
icate allows one to express not only contains-like queries but 
also Boolean operations on text predicates, regular expression 
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//MedlineCitation//*/text()[contains( ., "brain")] 
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//MedlineCitation//Country/text()[ 




contains(., "AUSTRALIA")] 
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//Country/text()[ contains(. , "AUSTRALIA")] 
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//*/text()[ contains( . , "1930")] 
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//MedlineCitation//*/text()[ contains( . , "1930") ] 
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//MedlineCitation/ Article/ AuthorList/Author/ 




LastName/text()[startswith(., "Bar")] 
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//MedlineCitation[ MedlineJournallnfo/ 




Country/text()[ ends-with(. "LAND")]] 
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//*[ Year = "2001"] 
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//*[ LastName = "Nguyen"] 




Fig. 8. Text oriented queries 



matching and so on. It is more efficient than the standard 
contains. In particular we used regular expression matching 
in lieu of the starts-with and ends-with operators 
since the latter were slower in our experiments. 

MonetDB: MonetDB supports some full-text capabilities 
through the use of the PF/Tijah text index ( [37]). While more 
efficient than using the built-in string functions, the set of 
queries expressible with this index is quite limited. The tree 
navigation part is limited to the descendant and self axes, 
while the only text predicate available is about, which allows 
selecting nodes which are "relevant" with respect to a given 
string. We used it to express contains and used the built-in 
string functions for other queries. 

Experiments were made against a 122MB Medline file. This 
file contains bibliographic information about life of sciences 
and bio medical publications. This test file featured 5,732,159 
text elements, for a total amount of 95MB of text content. 
Figure [8] shows the text queries we tested. We used count 
queries for both MonetDB and Qizx — enclosing the query 
in a fn:count () predicate — while in our implementation 
we ran the queries in "materialization" mode but without 
serializing the output. The table in Figure [9] summarizes the 
running times for each query. As we target very selective text 
queries, we also give, for each query, the number of results 
it returned. Since for these queries our automata worked in 
"bottom-up" mode, we detail the two following operations: 

> Calling the text predicate globally on the text collection, 
thus retrieving all the probable matches of the query (Text 
query line in the table of Figure |9]l 

• Running the automaton bottom up from the set of proba- 
ble matches to keep those satisfying the path expression 



(Auto, run line in the table of Figure |9]l 
As it is clear from the experiments the bottom-up strategy 
pays off. The only down-side of this approach is that the 
automaton uses Parent moves, which are less efficient than 
FirstChild and NextSibling. This is clear in queries T7 and 
T8 where the increase in number of results makes the relative 
slowness of the automata more visible. However our evaluator 
still outperforms the other engines even in those cases. 

E. Remarks 

We also compared with Tauro [3]. Yet, as it uses a tailored 
query language, we could not produce comparable results. 

We limited the experiments to natural language XML, 
although our engine (unlike the inverted file -based engines) 
supports as well queries on XML databases of continuous 
sequences such as DNA and proteins. Realistic queries on such 
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biosequence XMLs require approximate / regular expression 
search functionalities, that we already support but whose 
experimental study is out of the scope of this paper. 

VII. Conclusions and Future Work 

We have presented SXSI, a system for representing an XML 
collection in compact form so that fast indexed XPath queries 
can be carried out on it. Even in its current prototype stage, 
SXSI is already competitive with well-known efficient systems 
such as MonetDB and Qizx. As such, a number of avenues 
for future work are open. We mention the broadest ones here. 

Handling updates to the collections is possible in principle, 
as there are dynamic data structures for sequences, trees, and 
text collections [7], [8], [13]. What remains to be verified is 
how practical can those theoretical solutions be made. 

As seen, the compact data structures support several fancy 
operations beyond those actually used by our XPath evaluator. 
A matter of future work is to explore other evaluation strate- 
gies that take advantage of those nonstandard capabilities. As 
an example, the current XPath evaluator does not use the range 
search capabilities of structure Doc of Section [III] 

A clear direction for future work is to extend the current 
system to support XQuery operations. Even within full XPath 
1.0 there are very sophisticated primitives such as data joins, 
which would be challenging to support efficiently. 
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